Instance Driven Hierarchical Clustering of Document Collections

نویسندگان

  • Hassan H. Malik
  • John R. Kender
چکیده

The global pattern mining step in existing pattern-based hierarchical clustering algorithms may result in an unpredictable number of patterns. In this paper, we propose IDHC, a pattern-based hierarchical clustering algorithm that builds a cluster hierarchy without mining for globally significant patterns. IDHC allows each instance to "vote" for its representative size-2 patterns in a way that ensures an effective balance between local and global pattern significance. The number of patterns selected for each instance is dynamically determined using a local standard deviation based scheme, and the rest of the cluster hierarchy is obtained by following a unique iterative cluster refinement process. By effectively utilizing instance-to-cluster relationships, this process directly identifies clusters for each level in the hierarchy, and efficiently prunes duplicate clusters. Furthermore, IDHC produces cluster labels that are more descriptive (patterns are not artificially restricted), and adapts a soft clustering scheme that allows instances to exist in suitable nodes at various levels in the cluster hierarchy. We present results of experiments performed on 16 standard text datasets, and show that IDHC almost always outperforms state-of-the-art hierarchical clustering algorithms in terms of entropy, and achieves better FScores in most cases, without requiring tuning of parameter values.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On the Comparison of Semi-Supervised Hierarchical Clustering Algorithms in Text Mining Tasks

Semi-supervised clustering approaches have emerged as an option for enhancing clustering results. These algorithms use external information to guide the clustering process. In particular, semi-supervised hierarchical clustering approaches have been explored in many fields in the last years. These algorithms provide efficient and personalized hierarchical overviews of datasets. To the best of th...

متن کامل

Hierarchical Clustering in Medical Document Collections: the BIC-Means Method

Hierarchical clustering of text collections is a key problem in document management and retrieval. In partitional hierarchical clustering, which is more efficient than its agglomerative counterpart, the entire collection is split into clusters and the individual clusters are further split until a heuristically-motivated termination criterion is met. In this paper, we define the BIC-means algori...

متن کامل

A Probabilistic Hierarchical Clustering Method for Organising Collections of Text Documents

In this paper a generic probabilistic framework for the unsupervised hierarchical clustering of large-scale sparse high-dimensional data collections is proposed. The framework is based on a hierarchical probabilistic mixture methodology. Two classes of models emerge from the analysis and these have been termed as symmetric and asymmetric models. For text data specifically both asymmetric and sy...

متن کامل

خوشه‌بندی اسناد مبتنی بر آنتولوژی و رویکرد فازی

Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...

متن کامل

Incremental and Hierarchical Document Clustering

Over the past few decades, the volume of existing text data increased exponentially. Automatic tools to organize these huge collections of documents are becoming unprecedentedly important. Document clustering is important for organizing automatically documents into clusters. Most of the clustering algorithms process document collections as a whole; however, it is important to process these docu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008